-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Turn quadratic time on number of impl blocks into linear time #78317
Conversation
(rust_highfive has picked a reviewer for you, use r? to override) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiousity, how common is it to have 20+ impl blocks in real world code?
// faster asymptotic runtime. | ||
if impls.len() < 20 { | ||
for (i, &impl1_def_id) in impls.iter().enumerate() { | ||
for &impl2_def_id in &impls[(i + 1)..] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is how it is and there is no reason to change, but wouldn't cloning the iterator for the inner loop be nicer? There's no need to complicate the iteration when we only need to look ahead, I think.
Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDK for me as someone coming from C/C++, iterator magic is more complicated than doing things with indices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also have a background in C and I hate working with indices when I can avoid them, they are too much a footgun :)
Probably quite rare, however note that impl blocks might be distributed across multiple files. IIRC servo has something like this. I might misremember though. There might be no such case in the perf's testsuite at all. I just stumbled onto this code and it felt bad. Anyways, the swap I add above should have an impact for smaller number of impl blocks already. Maybe two perf runs are needed? IDK. |
I guess rustc has a ton of impl TyCtxt (I counter 21 + a macro). |
@bors try @rust-timer queue |
Awaiting bors try build completion |
⌛ Trying commit c17fe5be24fb07fd8f122c5cfd5cb1726d685f35 with merge 925eba0a75e6e25806ed3a4a654610329bd00254... |
A single location... won't be much of an impact I guess especially considering the fact that for 20 the effect is still low, at least compared to what the rest of the compiler is doing. Also note that I pulled the 20 out of my nose. It's been chosen on gut feelings alone. If there is an impact, it might be improved. If there's none, maybe it was chosen wrongly. |
☀️ Try build successful - checks-actions |
Queued 925eba0a75e6e25806ed3a4a654610329bd00254 with parent 7bade6e, future comparison URL. |
cc #69009 |
@jonas-schievink nice, someone had the same bright idea as me! I'd attribute the regression in that PR to cases where the brute-force search is better. The most extreme instance of this is probably two impl blocks, a short one followed by a very long one. If the order is right, the current O(n^2) search runs once over one of the short impl block, while looking up into the larger block. The linear algorithm of your PR would put both impl blocks into the hashmap and cause a lot of bureaucracy. In fact, a single long impl block would favour the O(n^2) search even better as. This gives me lots of ideas to improve the algorithm... The perf run of your PR shows (next to the regressions) a 2% improvement on syn-opt which is IMO a great achievement. Ten 2% improvements constitute a 19% improvement, 30 such improvements cut the time in half. Very encouraging! |
Finished benchmarking try commit (925eba0a75e6e25806ed3a4a654610329bd00254): comparison url. Benchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. Please note that if the perf results are neutral, you should likely undo the rollup=never given below by specifying Importantly, though, if the results of this run are non-neutral do not roll this PR up -- it will mask other regressions or improvements in the roll up. @bors rollup=never |
Up to 3.1% improvements for incr-unchanged of packed-simd-check... very nice. |
I would not look at incremental benchmarks on PRs like this - they're not likely to be super meaningful as an expression of improvement (or regression) because they are skipping parts of the build, and unless we think this optimization permits skipping more (seems unlikely), the percent win there is just larger because the base is smaller. It looks like this is about a 1% win in practice instruction counts wise, but appears to be a regression on wall times (e.g. rustc middle regressed roughly 2 seconds, it looks like). I think we are least don't have solid evidence this is a win on the code examples tested by perf. I'm not sure if perhaps switching algorithms when we detect n > 30 or something like that makes sense? |
Maybe test the impact on a crate like https://github.com/stm32-rs/stm32-rs-nightlies/tree/master/stm32f4? I usually use those when working on the overlap checks. |
Oh right it's red down there... a bit unfortunate :/. I think I should split up the PR as a next step. It's doing two separate optimizations, better measure them separately. In any case, the improvement for packed-simd is real, as the @Mark-Simulacrum Do you know whether there is a way to show the query times for rustc-middle and other red crates? |
c17fe5b
to
4ff10e8
Compare
We don't currently record self profile data for rustc compilation. |
FYI if you're wondering, |
Wow that crate has approximately a thousand inherent impl blocks per device that you can enable, and in total there are 11 features... If you enable them all you get 14043 impl blocks... So it's definitely a good real case stress test for overlap checking. It also has a ton of name overlap, making the namespace checks very relevant. I think it's a good benchmark, thanks for suggesting it! |
Note that the title of this PR as well as of #69009 can be misunderstood, as the entire impl overlap check in the worst case still runs in That being said, the earlier code has been changed from quadratic to linear time, and often there is no overlap. There's probably some way to speed up things for |
ee55a93
to
73a7d93
Compare
The PR is ready for review! I have done some smaller fixes that made it a bit slower than the version I ran the benchmarks on, e.g. I added sorting to make the output not depend on hash set iterator ordering. I don't want to run all benchmarks yet again (as they are currently quite manual), but this is a table about a subset of the benchmarks:
Also updated the table of the PR description. |
@rustbot modify labels: +S-waiting-on-review -S-waiting-on-author |
@matthewjasper friendly ping! |
I would expect this to be the main reason rustc is slow on diesel and stm32: rust-lang/rustc-perf#807 (comment) |
@jyn514 this PR addresses the |
@bors r+ |
📌 Commit 73a7d93 has been approved by |
⌛ Testing commit 73a7d93 with merge 3ce7ca6d86c4fd9982a4f2effc438bbfcb203ab2... |
The job Click to see the possible cause of the failure (guessed by this bot)
|
💔 Test failed - checks-actions |
cc @rust-lang/infra, I've seen this on a few different PRs now. It seems to be spurious, retrying will usually fix it. @bors retry |
☀️ Test successful - checks-actions |
That seems to be a spurious network failure while downloading LLVM:
|
Previously, if you had a lot of inherent impl blocks on a type like:
The compiler would be very slow at processing it, because
an internal algorithm would run in O(n^2), where n is the number
of impl blocks. Now, we add a new algorithm that allocates but
is faster asymptotically.
Comparing rustc nightly with a local build of rustc as of this PR (results in seconds):
I've tuned up the numbers to make the effect larger than the startup noise of rustc, but the asymptotic difference should hold for smaller n as well.
Note: current state of the PR omits error messages if there are other errors present already. For now, I'm mainly interested in a perf run to study whether this issue is present at all. Please queue one for this PR. Thanks!